Skip to content

Remove redundant _filter_ingested from streaming pipeline#271

Merged
gvanrossum merged 1 commit intomicrosoft:mainfrom
KRRT7:fix/remove-filter-ingested
May 5, 2026
Merged

Remove redundant _filter_ingested from streaming pipeline#271
gvanrossum merged 1 commit intomicrosoft:mainfrom
KRRT7:fix/remove-filter-ingested

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented May 5, 2026

Summary

  • Removes _filter_ingested method and all per-batch are_sources_ingested DB queries from add_messages_streaming
  • The streaming API is now a dumb pipe — callers handle dedup in their generators
  • ingest_email.py already pre-filters via is_source_ingested before yielding
  • ingest_vtt.py enforces a fresh DB (refuses to run if it exists)
  • ingest_podcast.py uses unique source_ids by construction
  • Removes dead batch_skipped counter from ingest_email.py
  • Net -213 lines

Why: on a 10k-email re-ingest with batch_size=100, the framework was issuing ~100 are_sources_ingested queries that always returned empty sets (because the caller already filtered). Pure waste.

Stack

This is the base of a 3-PR stack. Merge order: #271#267#268

# PR Branch Description
1 #271 (this) fix/remove-filter-ingested Remove framework dedup
2 #267 feat/vtt-streaming-ingestion VTT streaming migration
3 #268 refactor/email-dedup-consolidation Email bulk dedup

Test plan

  • make check passes (pyright 0 errors)
  • pytest tests/ passes (696 tests — 5 removed that tested framework-level dedup)
  • Removed tests verified as testing only the deleted behavior (pre-marking source_ids then expecting framework to skip them)

Closes #269

The framework no longer does per-batch dedup queries inside
add_messages_streaming. Callers are responsible for filtering
duplicates before yielding messages into the stream.

- ingest_email.py already pre-filters via is_source_ingested
- ingest_vtt.py enforces a fresh DB (refuses existing)
- podcast_ingest.py uses unique source_ids by construction

This eliminates ~N unnecessary are_sources_ingested DB round-trips
(one per batch) that always returned empty sets.

Closes microsoft#269
@KRRT7 KRRT7 force-pushed the fix/remove-filter-ingested branch from 1c71230 to 3875789 Compare May 5, 2026 07:24
KRRT7 added a commit to KRRT7/typeagent-py that referenced this pull request May 5, 2026
The framework no longer populates messages_skipped (removed in microsoft#271),
so the skipped counter and conditional output are dead code.
@gvanrossum gvanrossum merged commit a058b2f into microsoft:main May 5, 2026
16 checks passed
bmerkle pushed a commit that referenced this pull request May 5, 2026
…or (#268)

## Summary

**Upfront bulk dedup** (commit 1):
- Replaces N per-file `is_source_ingested` DB queries with a single
upfront `are_sources_ingested` bulk query + in-memory `set` lookup
- `_email_generator` now takes a pre-collected `email_entries` list and
`already_ingested: set[str]`, skipping known files before parsing
(avoids wasted parse cost)
- Single `_iter_emails` call feeds both the bulk pre-filter and the
generator
- Removed dead `messages_skipped` tracking from batch callback (always 0
since #271)

**Cleanup**:
- Removes `_flush_skipped`, skip tracking variables, unused
`IStorageProvider` import
- Net -49 lines

## Stack

**This PR is stacked on #267. Merge order: #271#267#268**

| # | PR | Branch | Description |
|---|---|---|---|
| 1 | #271 | `fix/remove-filter-ingested` | Remove framework dedup |
| 2 | #267 | `feat/vtt-streaming-ingestion` | VTT streaming migration |
| 3 | **#268** (this) | `refactor/email-dedup-consolidation` | Email
bulk dedup |

## Reproducible test

First ingest (populates DB):
```
mkdir -p /tmp/test_eml
# create 2 .eml files (any valid RFC 5322 .eml will do)
uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v
```

Re-ingest (should skip all):
```
uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v
```

Expected output on re-ingest:
```
Pre-filter: 2 of 2 already ingested

Parsing and importing emails...

Successfully ingested 0 email(s)
Ingested 0 chunk(s)
Skipped 2 already-ingested email(s)
Extracted 0 semantic references
Total time: 0.0s
```

Key behavior: the bulk query fires once before the generator runs. No
emails are parsed on re-ingest (0.0s total time).

## Test plan

- [x] `make format check test` passes (696 tests, 0 pyright errors)
- [x] Manual test: re-ingest same .eml files, verify all skipped via
upfront bulk dedup
- [x] Manual test: ingest with `--verbose`, verify skip reporting in
summary

---------

Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_filter_ingested runs redundant DB queries when the caller already pre-filters

2 participants